This course teaches students to use the statistical programming software R to prepare and analyze data.
This week I have learned to setup, perform and analyze simple linear regression with multiple variables.
lrn14<-read.csv("learning2014.csv",row.names=1)
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166 7
This dataset includes 166 rows and 7 columns (variables). The variables are Age, Gender, (exam) Points, Attitude, Surface Learning, Strategic Learning and Deep Learning.
install.packages(“GGally”) install.packages(“ggplot2”)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(GGally)
## Warning: package 'GGally' was built under R version 3.3.2
pairs(lrn14[-1],col=lrn14$gender)
ggpairs(lrn14, mapping=aes(col=gender, alpha=0.3), lower=list(combo=wrap("facethist",bins=20)))
summary(lrn14)
## gender Age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Most of the variables have relatively normal distributions, with the exception of age, which has a right-skewed distribution. The male attitude distribution could also be considered left-skewed. Most of the variables have weak relationships with each other, as exhibited by correlations less than 0.1. Notably, points and attitude have a strong positive relationship with a correlation value of 0.437. Deep learning and surface learning have a fairly strong negative correlation at -0.324, meaning that as deep learning increases, the tendency toward surface learning decreases. Surprisingly, median deep learning is higher than median surface learning, although there is higher variation with regard to deep learning.
my_model<-lm(Points~attitude+deep+surf, data=lrn14)
summary(my_model)
##
## Call:
## lm(formula = Points ~ attitude + deep + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
The only tested explanatory variable that exhibits a statistically significant relationship with the target variable points is attitude, as indicated by a p-value less than 0.05. For every 1 point increase in a student’s measured attitude, there is a 6.011 point increase in points scored on the exam. There is no significant relationship different from 0 between the explanatory variables deep learning and surface learning and the target variable points (exam score).
Ordinary least squares regression determines the Betas, or best fit lines, that minimize the differences between the fitted value and the observed value for each of the explanatory variables. The provided t-values represent the slopes of the linear best fit lines for each explanatory variable, and thus the relationship betewen the explanatory variable and the target variable. The p-value determines whether that is relationship is statistically significant, or due to chance.
my_model2<-lm(Points~attitude,data=lrn14)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Removing the explanatory variables surface learning and deep learning, which did not exhibit a statistically significant relationship with the target variable points, attitude remains statistically significant and exhibits a slightly larger relationship with/effect on points. The intercept also increases in value from 3.895 to 6.358. The median residual increases from the prior model, while the multiple r-squared value falls from 0.2024 to 0.1906, meaning that the second model has less explanatory power than the first.
R-squared is a measure of how close the true observations are to the fitted line, and thus a determination of how well the model explains the variation of observations around the data’s mean. It tells you how much of the variance in the target variable can be explained by the explanatory variables. R-squared values range from 0 to 1, with 0 indicating that the model explains none of the variation, and 1 indicating that the explanatory variables used explain all of the variation in the target variable. My model has an R-squared of 0.2024, meaning that it has relatively low explanatory power and there are substantial differences between the fitted lines for each explanatory variable and the actual observations.
plot(my_model2, which=c(1:2,5))
The assumptions of the linear regression model are that the errors are normally distributed, the errors are not correlated and the errors have constant varaiance (and, as such, the size of an error does not depend on the explanatory variables). For my model, the normality assumption is reasonable, because the points are for the most part tightly fitted to the line in the Q-Q Plot and there are no extreme outliers. The constant variance assumption is also reasonable, because there is a random spread of points with no clear pattern in the graph comparing residuals and model predictions. The Residuals vs. Leverage plot indicates that the data has regular leverage, because there is no significant outlier (the leverage values range from 0 to 0.04) that has a disproportionate effect on the model.
Introduction to logistic regression and cross validation
alc <- read.csv("alc.csv",row.names=1)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 382 35
The data inclues 382 rows (students) and 35 variables/columns. Each variable indicates attributes and grade information about students in either math or Portuguese language courses.
I will test Pstatus, studytime, failures and higher in relation to high/low alcohol consumption. For those students with Pstatus apart (false), there will be a higher probability of high alcohol consumption. For those students with higher weekly study times, there will be a lower probability of high aclohol consumption. For those students with a higher number of failures, there will be a higher probability of high alcohol consumption. For those students who want to pursue higher education, there will be a lower probablility of high aclohol consumption.
install.packages(“GGally”) install.packages(“ggplot2”)
library(ggplot2)
library(GGally)
bar<-ggplot(data=alc, aes(x=high_use))
bar+facet_wrap("Pstatus")+geom_bar()
bar+facet_wrap("studytime")+geom_bar()
bar+facet_wrap("failures")+geom_bar()
bar+facet_wrap("higher")+geom_bar()
table(high_use=alc$high_use, Pstatus=alc$Pstatus)
## Pstatus
## high_use A T
## FALSE 26 242
## TRUE 12 102
table(high_use=alc$high_use, studytime=alc$studytime)
## studytime
## high_use 1 2 3 4
## FALSE 58 135 52 23
## TRUE 42 60 8 4
table(high_use=alc$high_use, failures=alc$failures)
## failures
## high_use 0 1 2 3
## FALSE 244 12 10 2
## TRUE 90 12 9 3
table(high_use=alc$high_use, higher=alc$higher)
## higher
## high_use no yes
## FALSE 9 259
## TRUE 9 105
Based on the bar graphs and cross-tabulations, my hypotheses generally seem to be appropriate. For students whose parents live together, a higher proportion have low alcohol consumption compared with students whose parents are separated. Generally, the more students study the more likely they are to have low alcohol consumption. For students who plan to pursue higher education, a higher proportion have low alcohol consumption compared to students who don’t plan to pursue higher education. In terms of failures, however, there is a less clear relationship with alcohol consumption.
model<-glm(high_use~Pstatus+studytime+failures+higher, data=alc, family="binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ Pstatus + studytime + failures + higher,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5570 -0.7981 -0.7981 1.3000 2.0759
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.41983 0.68428 0.614 0.539520
## PstatusT -0.08093 0.37705 -0.215 0.830043
## studytime -0.52540 0.15683 -3.350 0.000808 ***
## failures 0.34848 0.19764 1.763 0.077866 .
## higheryes -0.26871 0.53569 -0.502 0.615936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 443.89 on 377 degrees of freedom
## AIC: 453.89
##
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept) PstatusT studytime failures higheryes
## 0.41983005 -0.08093225 -0.52540334 0.34848048 -0.26871220
The only stastically significant predictor of high aclohol consumption at the 0 level is study time. For every one unit increase in study time, there is a 0.52540 decresase in a student’s log odds of exhibiting high alcohol consumption.
install.packages(“tidyr”)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.2
OR<-coef(model)%>%exp
CI<-exp(confint(model))
## Waiting for profiling to be done...
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.5217029 0.3876426 5.7868685
## PstatusT 0.9222562 0.4488646 1.9917762
## studytime 0.5913168 0.4308370 0.7978098
## failures 1.4169129 0.9625861 2.1018616
## higheryes 0.7643632 0.2675755 2.2483346
Pstatus is associated with almost 1:1 odds of high alcohol consumption, but the odds of success are slightly lower for students whose parents are together, meaning that PstatusT is negatively associated with success(high aclohol consumption). Study time is associated with approximately 3:5 odds, indicating that those students who study more have lower odds of exhibiting high alcohol consumption. Failures is associated with approximately 3:2 odds, meaning the odds of success are higher for those students with more failures and thus failure is positively associated with high alcohol consumption. Higher education exhibits 3:4 odds, indicating that the desire to puruse higher education is negatively associated with high alchol consumption. Since none of the Confidence Intervals contain 1.0, all of the explanatory variables exhibit statistically significant associations with the target variable (high alcohol consumption) at the p<0.05 level.
Overall, the relationships posited in my hypothesis were correct, but only study time exhibited a statistically significant relationship with alcohol consumption in the regression model.
model2<-glm(high_use~studytime, data=alc, family="binomial")
summary(model2)
##
## Call:
## glm(formula = high_use ~ studytime, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0603 -0.8314 -0.8314 1.2993 2.1010
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3209 0.3076 1.043 0.297
## studytime -0.6029 0.1530 -3.941 8.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 448.31 on 380 degrees of freedom
## AIC: 452.31
##
## Number of Fisher Scoring iterations: 4
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
probabilities<-predict(model2, type="response")
alc<-mutate(alc, probability=probabilities)
alc<-mutate(alc, prediction=probability>0.5)
table(high_use=alc$high_use, prediction=alc$prediction)
## prediction
## high_use FALSE
## FALSE 268
## TRUE 114
select(alc, studytime, high_use, probability, prediction) %>% head(20)
## studytime high_use probability prediction
## 1 2 FALSE 0.2921823 FALSE
## 2 2 FALSE 0.2921823 FALSE
## 3 2 TRUE 0.2921823 FALSE
## 4 3 FALSE 0.1842725 FALSE
## 5 2 FALSE 0.2921823 FALSE
## 6 2 FALSE 0.2921823 FALSE
## 7 2 FALSE 0.2921823 FALSE
## 8 2 FALSE 0.2921823 FALSE
## 9 2 FALSE 0.2921823 FALSE
## 10 2 FALSE 0.2921823 FALSE
## 11 2 FALSE 0.2921823 FALSE
## 12 3 FALSE 0.1842725 FALSE
## 13 1 FALSE 0.4299752 FALSE
## 14 2 FALSE 0.2921823 FALSE
## 15 3 FALSE 0.1842725 FALSE
## 16 1 FALSE 0.4299752 FALSE
## 17 3 FALSE 0.1842725 FALSE
## 18 2 FALSE 0.2921823 FALSE
## 19 1 TRUE 0.4299752 FALSE
## 20 1 FALSE 0.4299752 FALSE
Interestingly, using the model with only study time as an explanatory variable, there are no predicted probabilities above .5 and thus no predictions of true (high alcohol consumption).
library(ggplot2)
g<-ggplot(alc, aes(x=probability, y=high_use, col=prediction))
g+geom_point()
loss_func<-function(class,prob){
n_wrong<-abs(class-prob)>0.5
mean(n_wrong)
}
loss_func(class=alc$high_use, prob=alc$probability)
## [1] 0.2984293
The training error is approximately .3, meaning that 30 percent of students are inaccurately classified in terms of their alcohol consumption (high/low). Simply guessing without a strategy gives you a 50 percent chance of being correct in a binary situation such as this, and guessing with some intuition/strategy most likely gives you an even higher chance of being correct. Thus, the performance of the model is perceivably not that much different from a simple guessing strategy.
install.packages(“boot”)
library(boot)
cv<-cv.glm(data=alc, cost=loss_func, glmfit=model, K=10)
cv$delta[1]
## [1] 0.3089005
cv<-cv.glm(data=alc, cost=loss_func, glmfit=model2, K=10)
cv$delta[1]
## [1] 0.2984293
Both models have higher average prediction error than the DataCamp model. To find such a model, you could test different amounts and types of explanatory variables together.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The dataset Boston has 506 rows and 14 columns/variables. The variables are characterised by factors that influence housing values in the suburbs of Boston.
install.packages(“corrplot”)
library(tidyr)
library(corrplot)
pairs(Boston)
cor_matrix<-cor(Boston)%>%round
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
## crim 1 0 0 0 0 0 0 0 1 1 0 0 0 0
## zn 0 1 -1 0 -1 0 -1 1 0 0 0 0 0 0
## indus 0 -1 1 0 1 0 1 -1 1 1 0 0 1 0
## chas 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## nox 0 -1 1 0 1 0 1 -1 1 1 0 0 1 0
## rm 0 0 0 0 0 1 0 0 0 0 0 0 -1 1
## age 0 -1 1 0 1 0 1 -1 0 1 0 0 1 0
## dis 0 1 -1 0 -1 0 -1 1 0 -1 0 0 0 0
## rad 1 0 1 0 1 0 0 0 1 1 0 0 0 0
## tax 1 0 1 0 1 0 1 -1 1 1 0 0 1 0
## ptratio 0 0 0 0 0 0 0 0 0 0 1 0 0 -1
## black 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## lstat 0 0 1 0 1 -1 1 0 0 1 0 0 1 -1
## medv 0 0 0 0 0 1 0 0 0 0 -1 0 -1 1
corrplot(cor_matrix, method="circle",type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6 )
According to the correlation plot, there appear to be strong relationships between a number of the variables in the dataset. For example, there is a strong, positive relationship between per capita crime rates and property tax rates. There is a strong, negative relationship between property tax rates and average distances to five Boston employment centers. Both of these relationships seem counterintuitive.
boston_scaled<-scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The numbers in the dataset were scaled down overall, as was the range.
boston_scaled<-as.data.frame(boston_scaled)
scaled_crim<-boston_scaled$crim
quants<-quantile(scaled_crim)
summary(quants)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 1.742000 0.007389 9.924000
string<-c("low", "med_low", "med_high", "high")
crime<-cut(scaled_crim, breaks=quants, include.lowest=TRUE, label=string)
summary(crime)
## low med_low med_high high
## 127 126 126 127
boston_scaled<-dplyr::select(boston_scaled, -crim)
boston_scaled<-data.frame(boston_scaled, crime)
n<-nrow(boston_scaled)
samp<-sample(n, size=n*0.8)
train<-boston_scaled[samp,]
test<-boston_scaled[-samp,]
lda.fit<-lda(crime~.,data=train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2425743 0.2475248 0.2673267
##
## Group means:
## zn indus chas nox rm
## low 0.8918285 -0.8903188 -0.07145661 -0.8620293 0.44811803
## med_low -0.1022210 -0.3673196 -0.07145661 -0.6091149 -0.10784998
## med_high -0.3997708 0.2382853 0.08200995 0.4192778 -0.07558083
## high -0.4872402 1.0169921 -0.01714665 1.0721662 -0.38942384
## age dis rad tax ptratio
## low -0.8368194 0.8549838 -0.6947544 -0.7479877 -0.40224556
## med_low -0.3895617 0.4227369 -0.5330315 -0.4551314 -0.04638988
## med_high 0.4187416 -0.3669939 -0.4857334 -0.3461383 -0.24505633
## high 0.8227458 -0.8570216 1.6393984 1.5149640 0.78225547
## black lstat medv
## low 0.37861116 -0.7550901 0.54551885
## med_low 0.35372990 -0.2368598 0.01462858
## med_high 0.07725263 0.1083084 0.04992807
## high -0.81954991 0.8873683 -0.73427207
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.03780501 0.55040784 -1.16715497
## indus 0.05097584 -0.39820994 0.04549061
## chas -0.01542152 0.05697729 0.08077706
## nox 0.36211071 -0.89071847 -1.14690905
## rm 0.00925602 -0.01431969 -0.18083925
## age 0.04226551 -0.29149861 -0.04802726
## dis -0.08723518 -0.23289488 0.31408343
## rad 3.95020151 0.77566097 -0.29134568
## tax 0.12063765 0.40034231 0.93165960
## ptratio 0.13953841 -0.13059996 -0.31794043
## black -0.08460433 0.08152816 0.15685794
## lstat 0.19478836 -0.29391550 0.19615526
## medv 0.05213528 -0.46894942 -0.17203865
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9642 0.0273 0.0084
lda.arrows<-function(x, myscale=1, arrow_heads=0.1, color="red", tex=0.75, choices=c(1,2)){
heads<-coef(x)
arrows(x0=0, y0=0,
x1=myscale*heads[,choices[1]],
y1=myscale*heads[,choices[2]],
col=color, length=arrow_heads)
text(myscale*heads[,choices],labels=row.names(heads),
cex=tex,col=color, pos=3)
}
classes<-as.numeric(train$crime)
plot(lda.fit, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale=1)
Low, med_low and med_high training data are all intermixed and spacially tight, while the high training data are mostly clustered together, separate from other categories and spacially distant (with the exception of a few med_high data intermixed and one high entry mixed into the other cluster). It appears as though the high predictions will be the most accurate (most likely because the training data are the most spacially distant from the others), followed by the low predictions. The med_low and med_high predictions seem as though they will be relatively inaccurate, and the categories could most likely be slimmed to two or three classifications/clusters.
correct_classes<-test$crime
test<-dplyr::select(test,-crime)
lda.pred<-predict(lda.fit, newdata=test)
table(correct=correct_classes, predicted=lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 9 2 0
## med_low 6 15 7 0
## med_high 0 5 17 4
## high 0 0 0 19
The LDA model was much more accurate at predicting the med_high and high results-it correctly predicted every high result. Low results were predicted correctly just over 50 percent of the time, although the incorrect predictions were all med_low. Med_low results were predicted under 50 percent of the time, and predictions ranged form low to med_high, though med_low was predicted the most.
library(MASS)
data("Boston")
boston_scaled<-scale(Boston)
dist_eu<-dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
km<-kmeans(dist_eu, centers=15)
pairs(Boston, col=km$cluster)
k_max<-15
twcss<-sapply(1:k_max, function(k){kmeans(dist_eu,k)$tot.withinss})
plot(1:k_max, twcss, type='b')
km2<-kmeans(dist_eu, centers=2)
pairs(Boston, col=km2$cluster)
Average number of rooms per dwelling (rm), nitrogen oxide concentration (nox) and proportion of owner-occupied units built prior to 1940 (age) seem to affect the clustering results.
library(MASS)
data("Boston")
boston_scaled<-scale(Boston)
dist_eu<-dist(boston_scaled)
km3<-kmeans(dist_eu, centers=4)
boston_scaled<-as.data.frame(boston_scaled)
lda.fit2<-lda(km3$cluster~.,data=boston_scaled)
lda.fit2
## Call:
## lda(km3$cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.1304348 0.2272727 0.2114625 0.4308300
##
## Group means:
## crim zn indus chas nox rm
## 1 1.4330759 -0.4872402 1.0689719 0.4435073 1.3439101 -0.7461469
## 2 0.2797949 -0.4872402 1.1892663 -0.2723291 0.8998296 -0.2770011
## 3 -0.3912182 1.2671159 -0.8754697 0.5739635 -0.7359091 0.9938426
## 4 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## age dis rad tax ptratio black
## 1 0.8575386 -0.9620552 1.2941816 1.2970210 0.42015742 -1.65562038
## 2 0.7716696 -0.7723199 0.9006160 1.0311612 0.60093343 -0.01717546
## 3 -0.6949417 0.7751031 -0.5965444 -0.6369476 -0.96586616 0.34190729
## 4 -0.3256000 0.3182404 -0.5741127 -0.6240070 0.02986213 0.34248644
## lstat medv
## 1 1.1930953 -0.81904111
## 2 0.6116223 -0.54636549
## 3 -0.8200275 1.11919598
## 4 -0.2813666 -0.01314324
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.18113078 0.5012256 0.60535205
## zn -0.43297497 1.0486194 -0.67406151
## indus -1.37753200 -0.3016928 -1.07034034
## chas 0.04307937 0.7598229 0.22448239
## nox -1.04674638 0.3861005 0.33268952
## rm 0.14912869 0.1510367 -0.67942589
## age 0.09897424 -0.0523110 -0.26285587
## dis -0.13139210 0.1593367 0.03487882
## rad -0.65824136 -0.5189795 -0.48145070
## tax -0.28903561 0.5773959 -0.10350513
## ptratio -0.22236843 -0.1668597 0.09181715
## black 0.42730704 -0.5843973 -0.89869354
## lstat -0.24320629 0.6197780 0.01119242
## medv -0.21961575 0.9485829 0.17065360
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7596 0.1768 0.0636
lda.arrows<-function(x, myscale=1, arrow_heads=0.1, color="red", tex=0.75, choices=c(1,2)){
heads<-coef(x)
arrows(x0=0, y0=0,
x1=myscale*heads[,choices[1]],
y1=myscale*heads[,choices[2]],
col=color, length=arrow_heads)
text(myscale*heads[,choices],labels=row.names(heads),
cex=tex,col=color, pos=3)
}
classes=as.numeric(km3$cluster)
plot(lda.fit2, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit2, myscale=3)
Property tax rates (tax), proportion of residential land zoned (zn) and nitrogen oxides concentration (nox) are the most influential lienar separators for the clusters. These variables mostly influence the fourth cluster.